10 November, 2023

Goals of this talk

Conceptual introduction to k-means clustering

  • How it works mechanistically, and considerations for analysis
  • Application to time-series data
  • Example of a study which used the method

A few first things

This work supported by the NSF (BCS-1944773)

Repository: https://github.com/jsteffman/k-means-clustering

This will also eventually (…) be a tutorial on https://jsteffman.github.io/teaching.html.

  • R package for traditional k-means clustering: Charrad, M., Ghazzali, N., Boiteau, V., Niknafs, A. (2014). NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software, 61(6), 1-36. URL http://www.jstatsoft.org/v61/i06/.

  • R package for time-series k-means clustering: Genolini, C., Alacoque, X., Sentenac, M., Arnaud, C. (2015). kml and kml3d: R Packages to Cluster Longitudinal Data. Journal of Statistical Software, 65(4), 1-34. URL http://www.jstatsoft.org/v65/i04/.

  • For much more on this data see: Cole, J. & Steffman, J. & Shattuck-Hufnagel, S. & Tilsen, S. (2023) “Hierarchical distinctions in the production and perception of nuclear tunes in American English”, Laboratory Phonology 14(1). doi: https://doi.org/10.16995/labphon.9437

Imitative Speech production paradigm

30 American English speakers

3 model utterances heard on a trial, 2 different speakers

  • “She quoted Helena”, “He answered Jeremy”, “Her name is Marilyn”
  • 8 “nuclear” distinct tunes on the final word, resynthesized F0

New, metrically similar, sentence produced

  • “They honored Melanie”, “He modeled harmony”, “She remained with Madelyn”
  • Instructions to imitate the melody of the utterance

F0 measures over the final word

  • Here, 30 time-normalised samples, speaker means for each of the 8 stimulus tunes (8 trajectories per speaker, 240 total)
  • Converted to ERB and scaled within speaker (normalising speaker F0 level/range differences)

Stimuli and Data

What 30 participants heard (left), and what they produced (right)

To begin, k-means in 2D

For llustration, trajectories made 2D, using “Tonal Center of Gravity

  • Temporal TCoG: Roughly, when in normalised time is F0 the highest? Earlier for falls, later for rises
  • Frequency TCoG: Here just mean F0 (can be weighted in various ways)

To begin, k-means in 2D

k-means in 2D

Seeds are set

(often at random, though other methods exist)

Three chosen here arbitrarily: the number of seeds will be the number of clusters

k-means in 2D

Points are allocated to clusters based on proximity to seeds (often Euclidian distance)

k-means in 2D

Cluster centroids are derived based on this first partition

k-means in 2D

Distance from the centroids is computed, and points are re-allocated

k-means in 2D

New centroids are derived based on re-allocated points

k-means in 2D

And, the process repeats …

k-means in 2D

And, the process repeats …

k-means in 2D

Until a solution is converged upon

No more shuffling of points

Many ks, many possible solutions

Cluster solution evaluation and optimization is thus a critical part of analysis

Selecting the optimal k value

  • Various criteria: General intuition is the “best” partition optimizes within and between cluster variance

  • R packages allow for comparison across multiple criteria, can also evaluate by “majority vote”

  • In addition to this “best” k value, others may be of interest as informed by predictions or theory

Selecting the optimal k value

  • Various criteria: General intuition is the “best” partition optimizes within and between cluster variance

  • R packages allow for comparison across multiple criteria, can also evaluate by “majority vote”

  • In addition to this “best” k value, others may be of interest as informed by predictions or theory

Translating to time series

The key difference is in the understanding of DISTANCE, where distances are considered at each time point when comparing two trajectories

Translating to time series

The key difference is in the understanding of DISTANCE, where distances are considered at each time point when comparing two trajectories

Translating to time series

Application to the data

In the data set, five was selected as the optimal number of clusters

Consdering stimulus-cluster relationships

We can also consider other clustering solutions

We can also consider other clustering solutions

Individual variation

Individual clustering solutions can become the object of analysis

  • Potentially related to group or individual characteristics
  • At left, a histogram showing individual level variation in terms of optimal k
  • At right, the relationship between this number and mutual information: i.e. the predictability of the relationship between stimulus tunes and clusters

Individual variation

k-means (left) vs. hierarchical clustering (right)

Concluding remarks

K-means offers a fairly intuitive way of assessing groupings in data

  • And is easily implemented for time series data
  • Tests of different numbers of k can be interweaved with domain knowledge or predictions
  • Robust to outliers, easy to test with a range of k values

There are however, crucial considerations

  • Unlike hierarchical clustering, for which a single dendrogram can be “cut” in different locations to yield different numbers of clusters, k-means offers many possible solutions based on the range of k’s tested
  • This makes k selection, and optimization a crucial part of the k-means workflow
  • This can happen all “under the hood” in R, with various degrees of researcher intervention
  • Other considerations are shared between the methods: normalisation/scaling, distance measures, etc.

Concluding remarks

K-means offers a fairly intuitive way of assessing groupings in data

  • And is easily implemented for time series data
  • Tests of different numbers of k can be interweaved with domain knowledge or predictions
  • Robust to outliers, easy to test with a range of k values

There are however, crucial considerations

  • Unlike hierarchical clustering, for which a single dendrogram can be “cut” in different locations to yield different numbers of clusters, k-means offers many possible solutions based on the range of k’s tested
  • This makes k selection, and optimization a crucial part of the k-means workflow
  • This can happen all “under the hood” in R, with various degrees of researcher intervention
  • Other considerations are shared between the methods: normalisation/scaling, distance measures, etc.

Thank you!